General Concepts
In machine learning, the evaluation of models is a crucial step to understand their performance. Two key concepts in this process are the hypothesis and the loss functions.
Hypothesis
The hypothesis, typically denoted as , represents the model chosen to predict outputs given certain input data. For input , the model prediction is .
Loss Functions
Loss functions measure the difference between the actual values and predictions. They are essential for training a model, providing feedback on its performance. Common loss functions include:
-
Least Squared Error (for Linear Regression):
-
Logistic Loss (for Logistic Regression):
-
Hinge Loss (for Support Vector Machine - SVM):
-
Cross-Entropy Loss (for Neural Networks):
The graphs associated with each loss function show how the error changes with respect to the predicted value for different actual values .
Cost Function
The cost function, denoted as , aggregates the losses across all training examples and is used to assess the performance of the model. It is defined as the sum of individual loss function values for all training examples:
where is the chosen loss function, is the hypothesis for the example, and is the actual value.
This framework allows for the optimization of the model parameters through training, often using algorithms like gradient descent, with the goal of minimizing the cost function .
Optimization Algorithms in Machine Learning
Optimization algorithms are essential for finding the best parameters for machine learning models. These algorithms aim to minimize the cost function, which measures the prediction error of a model.
Gradient Descent
Gradient Descent is a foundational optimization method used to minimize the cost function by updating the parameters in the opposite direction of the gradient of the cost function .
-
Update Rule:
-
: Learning rate, a positive scalar determining the step size.
-
: Gradient of the cost function with respect to the parameters.
The graphical representation shows concentric contours of the cost function with the gradient pointing towards the direction of steepest ascent. Gradient descent moves in the opposite direction to reach the minimum.
- Stochastic Gradient Descent (SGD): Updates parameters for each training example.
- Batch Gradient Descent: Updates parameters for a batch of training examples.
Likelihood
The likelihood function measures how probable the observed data is, given a set of parameters .
-
Optimization Goal:
-
In practice, the log-likelihood is optimized since it is easier to work with, especially when dealing with products of probabilities.
Newton's Algorithm
Newton's algorithm, also known as the Newton-Raphson method, is an optimization technique that finds the parameters by solving , where is typically a loss or likelihood function.
-
Update Rule (Scalar Case):
-
Update Rule (Multidimensional Generalization):
Here, is the Hessian matrix of second-order partial derivatives. This method takes into account the curvature of , which can lead to faster convergence compared to gradient descent, especially in well-behaved quadratic problems.